Data-Efficient Policy Evaluation Through Behavior Policy Search
نویسندگان
چکیده
We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for the optimal behavior policy—the behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.
منابع مشابه
Evaluation of Monetary and Fiscal Policy Based on New Keynesian Dynamic General Equilibrium Model in Iran’s Economy
This paper examines monetary and fiscal policy through the estimation of a New-Keynesian dynamic general equilibrium model for Iran’s economy. In this New-Keynesian dynamic general equilibrium model, the consumers encounter the liquidity constraint and the firms face sticky prices, while they are changing them. In the model presented, a role is considered for both government spending and taxati...
متن کاملPILCO: A Model-Based and Data-Efficient Approach to Policy Search
In this paper, we introduce pilco, a practical, data-efficient model-based policy search method. Pilco reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, pilco can cope with very little data and facilitates learning from scratch ...
متن کاملData-Efficient Reinforcement Learning in Continuous-State POMDPs
We present a data-efficient reinforcement learning algorithm resistant to observation noise. Our method extends the highly data-efficient PILCO algorithm (Deisenroth & Rasmussen, 2011) into partially observed Markov decision processes (POMDPs) by considering the filtering process during policy evaluation. PILCO conducts policy search, evaluating each policy by first predicting an analytic distr...
متن کاملAddressing Health Equity Through Action on the Social Determinants of Health: A Global Review of Policy Outcome Evaluation Methods
Background Epidemiological evidence on the social determinants of health inequity is well-advanced, but considerably less attention has been given to evaluating the impact of public policies addressing those social determinants. Methodological challenges to produce evidence on policy outcomes present a significant barrier to mobilising policy actions for health equities. This review aims to exa...
متن کاملطراحی و ساخت پایگاه وب منابع اطلاعات شاخص های پایش و ارزیابی علم، فناوری و نوآوری
So far, many indicators for evaluation of science, technology and innovation have been presented in various documents in Iran. Also, many indicators have been mentioned in the reports of international organizations. Selection and use of the indicators is difficult for policy makers and researchers because of the abundance and distribution of them in various domestic and international documents ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017